Saving Data, and Tools for Large or Concurrent Jobs

Five use cases are considered:

  • saving output in common formats for sharing (CSV, Excel)
  • saving output in binary formats for further analysis (pickle, HDF5, SQL)
  • processing a large video, saving results one frame at a time
  • processing many videos in parallel
  • accessing partially complete results during analysis

Saving data

In the simplest case, you can locate the features in every frame of a movie, and output them to a variable.


In [4]:
import mr

In [10]:
v = mr.Video('/home/dallan/mr/mr/tests/water/bulk-water.mov')

In [12]:
f = mr.batch(v[:3], 11, 3000)


mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

The result is a DataFrame, which can be saved in formats convenient for sharing, like Excel


In [13]:
f.to_excel('features.xlsx')

or comma-separated values


In [14]:
f.to_csv('features.csv')

These formats are slow to read and write. If you not are not sending the file to a non-programmer, it is better to save it as a binary file.


In [37]:
f.save('features.df') # df for DataFrame -- could be any name you want

Saving large jobs while they run

For large jobs, it is better to save the features one frame at a time as the job proceeds. If the job is interrupted, partial progress will be saved. And the job requires only enough memory to process one frame at a time -- it need not hold all the frames' data.

batch can do this in two different ways: using an HDF5 file (a fast binary format) or a SQL database.

HDF5

For HDF5, we open an HDF5 file using pandas, and pass it to batch.


In [20]:
store = pd.HDFStore('data.h5')
f = mr.batch(v[:3], 11, 3000, store=store, table='bulk_water/features')
# table can take any unique name -- even slashes and spaces are OK


mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

batch saves the data one frame at a time, discarding each frame's data before it begins the next one. In this way, memory is conserved and long videos can be processed. At the end, batch loads the data out of the HDF5 file and returns it in the variable f.

In some cases, if you wish to run jobs simultaneous in several Python sessions, you might want to leave the data in store and retrieve it later, in part or in full. Use do_not_return=True.


In [22]:
mr.batch(v[:3], 11, 3000, store=store, table='bulk_water/features', do_not_return=True)
# This returns nothing.


mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

We can load it from the store later.


In [25]:
f = store['bulk_water/features']
f.head()


Out[25]:
x y mass size ecc signal ep frame
2 36.048647 8.120968 3844 2.771191 0.146392 22.279131 0.370538 0
3 67.232830 7.869478 3509 2.570799 0.053752 23.279131 0.412061 0
5 430.957784 7.319437 5685 2.763565 0.288109 26.279131 0.315874 0
6 629.180087 8.195757 4148 3.248655 0.216354 14.279131 0.420683 0
12 552.773313 11.108589 3260 2.211168 0.118502 29.279131 0.442856 0

If it is too large, we can fetch it in part:


In [43]:
f = store.select('features', pd.Term('frame < 3'))
f.head()


Out[43]:
x y mass size ecc signal ep frame
2 36.048647 8.120968 3844 2.771191 0.146392 22.279131 0.370538 0
3 67.232830 7.869478 3509 2.570799 0.053752 23.279131 0.412061 0
5 430.957784 7.319437 5685 2.763565 0.288109 26.279131 0.315874 0
6 629.180087 8.195757 4148 3.248655 0.216354 14.279131 0.420683 0
12 552.773313 11.108589 3260 2.211168 0.118502 29.279131 0.442856 0

SQL

As an alternative to HDF5, we can use a SQL database. The simplest choice is sqlite, which uses a single file to store a database.


In [33]:
import sqlite3
conn = sqlite3.connect('data.sql')
f = mr.batch(v[:3], 11, 3000, conn=conn, sql_flavor='sqlite', table='bulk_water/features')


mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

A MySQL database is also supported. The mr.sql module provides a convenience function for making a MySQL database connection.


In [32]:
f = mr.batch(v[:3], 11, 3000, conn=mr.sql.connect(), sql_flavor='mysql', table='bulk_water/features')


mr.core.feature.locate:  693 local maxima, 259 of qualifying mass
mr.core.feature.batch:  Frame 0: 259 features
mr.core.feature.locate:  695 local maxima, 203 of qualifying mass
mr.core.feature.batch:  Frame 1: 203 features
mr.core.feature.locate:  702 local maxima, 184 of qualifying mass
mr.core.feature.batch:  Frame 2: 184 features
mr.core.feature.locate:  706 local maxima, 220 of qualifying mass
mr.core.feature.batch:  Frame 3: 220 features

As with HDF5, you can conserve memory using do_not_return=True.

Accesssing partial data sets without interrupting analysis

Finally, sometimes it is convenient examine the early results while the full video is still being processed. This is not possible with an HDF5 file, which does not support concurrent reading and writing. But SQL makes it possible.


In [36]:
partial = pd.io.sql.read_frame('select * from bulk_water_features', conn)
partial.head()


Out[36]:
x y mass size ecc signal ep frame
0 36.048647 8.120968 3844 2.771191 0.146392 22.279131 0.370538 0
1 67.232830 7.869478 3509 2.570799 0.053752 23.279131 0.412061 0
2 430.957784 7.319437 5685 2.763565 0.288109 26.279131 0.315874 0
3 629.180087 8.195757 4148 3.248655 0.216354 14.279131 0.420683 0
4 552.773313 11.108589 3260 2.211168 0.118502 29.279131 0.442856 0

Here we have the full result because my short example job is done and already finished.